Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Identifying the Original Contribution of a Document via Language Modeling

Identifieur interne : 000828 ( Main/Exploration ); précédent : 000827; suivant : 000829

Identifying the Original Contribution of a Document via Language Modeling

Auteurs : Benyah Shaparenko [États-Unis] ; Thorsten Joachims [États-Unis]

Source :

RBID : ISTEX:76942CF8EF4C6E7F34B05603DA66CA8CFA39E8AD

Abstract

Abstract: One major goal of text mining is to provide automatic methods to help humans grasp the key ideas in ever-increasing text corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and the model is used to identify each document’s most original passages. Unlike heuristic approaches, the statistical model is extensible and open to analysis. We evaluate the approach both on synthetic data and on real data in the domains of research publications and news, showing that the passage impact model outperforms a heuristic baseline method.

Url:
DOI: 10.1007/978-3-642-04174-7_23


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct:series">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Identifying the Original Contribution of a Document via Language Modeling</title>
<author>
<name sortKey="Shaparenko, Benyah" sort="Shaparenko, Benyah" uniqKey="Shaparenko B" first="Benyah" last="Shaparenko">Benyah Shaparenko</name>
</author>
<author>
<name sortKey="Joachims, Thorsten" sort="Joachims, Thorsten" uniqKey="Joachims T" first="Thorsten" last="Joachims">Thorsten Joachims</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:76942CF8EF4C6E7F34B05603DA66CA8CFA39E8AD</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/978-3-642-04174-7_23</idno>
<idno type="url">https://api.istex.fr/document/76942CF8EF4C6E7F34B05603DA66CA8CFA39E8AD/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001106</idno>
<idno type="wicri:Area/Istex/Curation">001051</idno>
<idno type="wicri:Area/Istex/Checkpoint">000350</idno>
<idno type="wicri:doubleKey">0302-9743:2009:Shaparenko B:identifying:the:original</idno>
<idno type="wicri:Area/Main/Merge">000836</idno>
<idno type="wicri:Area/Main/Curation">000828</idno>
<idno type="wicri:Area/Main/Exploration">000828</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Identifying the Original Contribution of a Document via Language Modeling</title>
<author>
<name sortKey="Shaparenko, Benyah" sort="Shaparenko, Benyah" uniqKey="Shaparenko B" first="Benyah" last="Shaparenko">Benyah Shaparenko</name>
<affiliation wicri:level="4">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science, Cornell University, 14853, Ithaca, NY</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
<settlement type="city">Ithaca (New York)</settlement>
</placeName>
<orgName type="university">Université Cornell</orgName>
</affiliation>
</author>
<author>
<name sortKey="Joachims, Thorsten" sort="Joachims, Thorsten" uniqKey="Joachims T" first="Thorsten" last="Joachims">Thorsten Joachims</name>
<affiliation wicri:level="4">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science, Cornell University, 14853, Ithaca, NY</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
<settlement type="city">Ithaca (New York)</settlement>
</placeName>
<orgName type="university">Université Cornell</orgName>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2009</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">76942CF8EF4C6E7F34B05603DA66CA8CFA39E8AD</idno>
<idno type="DOI">10.1007/978-3-642-04174-7_23</idno>
<idno type="ChapterID">23</idno>
<idno type="ChapterID">Chap23</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: One major goal of text mining is to provide automatic methods to help humans grasp the key ideas in ever-increasing text corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and the model is used to identify each document’s most original passages. Unlike heuristic approaches, the statistical model is extensible and open to analysis. We evaluate the approach both on synthetic data and on real data in the domains of research publications and news, showing that the passage impact model outperforms a heuristic baseline method.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>État de New York</li>
</region>
<settlement>
<li>Ithaca (New York)</li>
</settlement>
<orgName>
<li>Université Cornell</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="État de New York">
<name sortKey="Shaparenko, Benyah" sort="Shaparenko, Benyah" uniqKey="Shaparenko B" first="Benyah" last="Shaparenko">Benyah Shaparenko</name>
</region>
<name sortKey="Joachims, Thorsten" sort="Joachims, Thorsten" uniqKey="Joachims T" first="Thorsten" last="Joachims">Thorsten Joachims</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000828 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000828 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:76942CF8EF4C6E7F34B05603DA66CA8CFA39E8AD
   |texte=   Identifying the Original Contribution of a Document via Language Modeling
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024